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Abstract. It is difficult to find the optimal sparse solution of a manifold learning 
vN , based dimensionality reduction algorithm. The lasso or the elastic net penalized 

manifold learning based dimensionality reduction is not directly a lasso penal- 
r K , ized least square problem and thus the least angle regression (LARS) (Efron et 

al. 2004), one of the most popular algorithms in sparse learning, cannot be ap- 
plied. Therefore, most current approaches take indirect ways or have strict set- 
tings, which can be inconvenient for applications. In this paper, we proposed the 
manifold elastic net or MEN for short. MEN incorporates the merits of both the 
manifold learning based dimensionality reduction and the sparse learning based 
£/") ■ dimensionality reduction. By using a series of equivalent transformations, we 

^- ' show MEN is equivalent to the lasso penalized least square problem and thus 

LARS is adopted to obtain the optimal sparse solution of MEN. In particular, 
>0 ■ MEN has the following advantages for subsequent classification: 1) the local ge- 

ometry of samples is well preserved for low dimensional data representation, 
2) both the margin maximization and the classification error minimization are 
f~^. ■ considered for sparse projection calculation, 3) the projection matrix of MEN 

<^> improves the parsimony in computation, 4) the elastic net penalty reduces the 

^^ . over-fitting problem, and 5) the projection matrix of MEN can be interpreted 

psychologically and physiologically. Experimental evidence on face recognition 
over various popular datasets suggests that MEN is superior to top level dimen- 
sionality reduction algorithms. 



X 



1 Introduction 



One of the primary focuses in data mining and machine learning is finding a succinct 
and effective representation for original high dimensional samples (Hastie et al. 2009; 
Kriegel et al. 2007; Ding and Li 2007; Ding et al. 2008; Li et al. 2008; Tao et al. 2007a; 
Tao et al. 2007b). Linear dimensionality deduction is such a tool that projects the origi- 
nal samples from a high dimensional space to a low dimensional subspace. Meanwhile 
some particular information, e.g., manifold structure and discriminative information, 
of the original high dimensional samples will be well preserved while noises will be 
removed in the selected subspace. 



1.1 The state of the art 

In the past decades, a dozen of algorithms have been developed and extensive experi- 
mental results have demonstrated that duly selected subspace is effective and efficient 
for subsequent utilizations. In this paper, we categorize popular dimensionality reduc- 
tion algorithms into the following three groups: 

1. Conventional linear dimensionality reduction algorithms, e.g., principal compo- 
nents analysis (PCA) (Hotelling 1936), Fisher's linear discriminant analysis (FLDA) 
(Fisher 1936), regularized FLDA, and the geometric mean based subspace selec- 
tion (Tao et al. 2009). All of these algorithms assume samples are drawn from dif- 
ferent Gaussians. PCA maximizes the mutual information between original high- 
dimensional Gaussian distributed samples and projected low-dimensional samples. 
PCA, which is unsupervised, does not utilize the class label information. While, 
LDA finds a projection matrix that maximizes the trace of the between-class scatter 
matrix and minimizes the trace of the within-class scatter matrix in the projected 
subspace simultaneously. The same as PCA, FLDA and regularized FLDA assume 
samples are drawn from homoscedastic Gaussians. Therefore, FLDA and regular- 
ized FLDA cannot work well when Gaussians are heteroscedastic. Additionally, 
they always merge classes which are close in the high dimensional space. Although 
the geometric mean based subspace selection and its harmonic mean based exten- 
sion (Bian and Tao 2008) assume samples are drawn from heteroscedastic Gaus- 
sians and do not tend to merge close classes, they basically work for Gaussian 
distributed samples. 

2. Manifold learning based dimensionality reduction algorithms: e.g., locally linear 
embedding (LLE) (Roweis and Saul 2000), ISOMAP (Tenenbaum et al. 2000), 
Laplacian eigenmaps (LE) (Belkin and Niyogi 2001; Li et al. 2008), Hessian eigen- 
maps (HLLE) (Donoho and Grimes 2003), Generative Topographic Mapping (GTM) 
(Bishop et al. 1998; Fyfe 1) and local tangent space alignment (LTSA) (Zhang 
and Zha 2005). LLE uses linear coefficients, which reconstruct a given measure- 
ment by its neighbours, to represent the local geometry, and then seeks a low- 
dimensional embedding, in which these coefficients are still suitable for reconstruc- 
tion. ISOMAP preserves global geodesic distances of all pairs of measurements. 
LE preserves proximity relationships by manipulations on an undirected weighted 
graph, which indicates neighbour relations of pairwise measurements. LTSA ex- 
ploits the local tangent information as a representation of the local geometry and 
this local tangent information is then aligned to provide a global coordinate. HLLE 
obtains the final low-dimensional representations by applying eigen-analysis to a 
matrix which is built by estimating the Hessian over neighbourhood. All these al- 
gorithms have the out of sample problem and thus a dozen of linearizations have 
been proposed, e.g., locality preserving projections (LPP) (He and Niyogi 2004), 
neighborhood preserving embedding (NPE) (He et al. 2005), and orthogonal neigh- 
bourhood preserving projections (ONPP). Recently, we provide a systematic frame- 
work, i.e., patch alignment (Zhang et al. 2008; Zhang et al. 2009), for understand- 
ing the common properties and intrinsic difference in different algorithms including 
their linearizations. In particular, this framework reveals that: i) algorithms are in- 
trinsically different in the patch optimization stage; and ii) all algorithms share an 



almost-identical whole alignment stage. Another unified view of popular manifold 
learning algorithms is the graph embedding framework (Yan et al. 2007). Based 
on both frameworks, different algorithms have been developed, e.g., the discrim- 
inative locality alignment (Liu et al. 2008), manifold regularization (Belkin et al. 
2006) and marginal Fisher's analysis (Wang et al. 2008). 

Sparse learning based dimensionality reduction algorithms: e.g., lasso (Tibshirani 
1996), elastic net (Zou and Hastie 2005), the smoothly clipped absolute devia- 
tion penalty (SCAD) (Fan and Li 2001), Sure independence screening (Fan and 
Lv 2008), Dantzig selector (Candes and Tao 2005) and Dantzig selector with se- 
quential optimization (Dasso) (James et al. 2009). Conventional linear dimension- 
ality reduction algorithms and manifold learning based dimensionality reduction 
algorithms produce a low dimensional subspace and each basis of the subspace is 
a linear combination of all the original bases (i.e., variables or features) used for 
high dimensional sample representation. Therefore, results cannot be interpreted 
psychologically and physiologically. Sparse learning based dimensionality reduc- 
tion algorithms are developed not only to achieve the dimensionality reduction but 
also to reduce the number of explicitly used variables. A direct method to reduce 
the number of variables for representation is setting very small coefficients as zero. 
However, this strategy is problematic because small coefficients could be very im- 
portant. Because each of new bases is a linear combination of original ones, it is 
reasonable to consider each new basis as the response of several variables, i.e., the 
original features. Then the problem of sparse learning becomes a similar problem to 
variables selection and coefficients shrinkage. In linear regression, Lp norm penalty 
is always combined with the loss function to reduce over-fitting. In particular, t\- 
norm (or lasso) owns a good property to drive a good number of coefficients to zero 
and lead to a sparse model between responses and variables because of its singular- 
ity in the origin (Park and Hastie 2006; Huang and Chris Ding 2008). The number 
of lasso selected variables is no larger than the number of samples. Moreover, lasso 
randomly selects one from the group of variables that are high correlated. There- 
fore, elastic net is proposed to address the above problems and achieve the grouping 
effect by adding the li penalty to lasso. 

In recent years, sparse learning becomes popular, because: 

sparsity can make the data more succinct and simpler, so the calculation of the 
low dimensional representation and the subsequent processing, e.g., classification 
and regression, becomes more efficient. Parsimony is especially an important factor 
when the dimension of the original samples is very high and the number of samples 
is very large; 

sparsity can control the weights of original variables and decrease the variance 
brought by possible over-fitting with the least increment of the bias. Therefore, the 
learn model can generalize better; and 

sparsity provides a good interpretation of a model, thus reveals an explicit relation- 
ship between the objective of the model and the given variables. This is important 
for understanding practical problems, especially when the number of variables is 
larger than that of the samples. 



However, it is not easy to find the optimal solution of a sparse learning model. In 
the original lasso, the residue sum of squares is minimized subject to the sum of the ab- 
solute value of the coefficients being less than a constant. The quadratic programming 
is sequentially utilized to get the solution and thus the time cost is not acceptable for 
practical applications. Recently, the least angle regression (LARS) is proposed to seek 
a close form solution to the path of coefficients in each step without using the quadratic 
programming, so it is more efficient and less greedy than the original optimization al- 
gorithm used in lasso. 

Hitherto, most of sparse dimensionality reduction algorithms are designed for linear 
regression and only a few can be applied for subsequent classification, e.g., sparse prin- 
cipal component analysis (SPCA) (Zou and Hastie 2006a), Nonnegative sparse prin- 
cipal component analysis (Zass and Shashua 2007), sparse linear discriminant analy- 
sis (SLDA), sparse projections over graph (SPOG) (Cai et al. 2007; Cai et al. 2008) 
and SPCA using semi-definite programming (Aspremont et al. 2007). Both SPCA and 
SPCA using semi-definite programming do not consider the sample label information 
and thus some discriminative information will be removed after dimensionality reduc- 
tion. SLDA can work well for binary class classification but it cannot be applied for 
multi-class classification. SPOG utilizes a particular manifold learning based dimen- 
sionality reduction algorithm, e.g., locality preserving projections (LPP), to obtain the 
dense projection matrix and then applies lasso to regress the corresponding sparse pro- 
jection matrix. Absolutely the problem is indirectly formulated to obtain the sparse 
projection matrix. A direct formulation should be imposing the lasso penalty over a 
loss function (i.e., a criterion) of a dimensionality reduction algorithm. However, it is 
difficult to use LARS to obtain its optimal solution because the objective function is 
not a direct regression problem. Therefore, researchers currently take indirect routs to 
obtain sparse projection matrices. 



1.2 The proposed approach 

In this paper, we propose the manifold elastic net (MEN), which obtains a sparse projec- 
tion matrix for subsequent classification. MEN directly imposes the elastic net penalty 
(i.e., the combination of the lasso penalty and the ^2-norm penalty) over the loss (i.e., 
the criterion) of a discriminative manifold learning based dimensionality reduction al- 
gorithm. By using a series of complex linear algebra equivalent transformations, the 
objective function of MEN can be rewritten as a lasso penalized least square problem 
and thus LARS can be applied to obtain the optimal sparse solution of MEN. 

In detail, we first apply the part optimization of the patch alignment framework to 
encode the local geometry of a set of training samples. In the second step, the whole 
alignment of the patch alignment framework is applied to calculate the unified coordi- 
nate system for local patches obtained in the first step. For low dimensional data repre- 
sentation, the linearization or the linear approximation is adopted in MEN. Although we 
can impose some discriminative information preservation criterion (e.g., margin maxi- 
mization) over the part optimization stage, it is not directly relevant to the classification 
error minimization. Therefore, we put a new item that minimizes the classification error 
in the third step. To obtain a sparse projection matrix with the grouping effect, in the 



fourth step, the elastic net penalty is adopted in MEN. So far, the objective function of 
MEN is fully constructed. 

With the well defined MEN, we then apply LARS to obtain the optimal solution of 
MEN. We transform MEN into a form in which the correlation of basis can be written 
as the correlation of coefficients. Active set is built according to LARS. In each step, no 
more than one element of the basis is added to the active set according to its correlation. 
All elements in the active set are changed in each step with special direction and dis- 
tance in the space of coefficients. The direction and distance of a path in each step have 
closed form solution according to the extended simplex. The sparsity of the projection 
matrix is controlled by the cardinality of the active set. Because the LARS for MEN 
generates bases in an independent way, the same procedure is conducted multiple times 
to obtain a set of bases. Under this procedure, these bases are orthogonal. Thorough 
experiments on face recognition (Shakhnarovich and Moghaddam 2004) task based on 
popular face datasets show the effectiveness of the proposed MEN by comparing against 
the top level dimensionality reduction algorithms. 

The rest of the paper is organized as follows. Section 2 presents the proposed man- 
ifold elastic net (MEN) including the objective function of MEN and the LARS opti- 
mization for MEN. Section 3 shows the effectiveness of MEN for face recognition over 
different face datasets. Section 4 concludes. 



2 Manifold Elastic Net 

Consider in the discriminative dimensionality reduction problem with training sam- 
ples and corresponding class labels. Let X = [xi,x%, ■ ■ ■ ,x n ] € M. nxp be a given 
training set in a high dimensional space M™ xp and C — [c\,C2, • • • ,c n ] <G K™ be 
the corresponding class label vector. The objective here is to find a projection matrix 
W = [wi,W2, • • • , Wd] € R pxd that projects samples x T € R p in the high dimen- 
sional space onto a low dimensional subspace, i.e., z T = x T W, such that samples from 
different classes can be well separate, i.e., the classification error can be extremely min- 
imized. 

Manifold learning based dimensionality reduction aims to find the corresponding 
low dimensional representation z in a low dimensional Euclidean space of x to preserve 
(actually approximate) the data intrinsic structure. Popular manifold learning based di- 
mensionality reduction algorithms, however, have the following two problems: 1) the 
classification error is not directly and explicitly considered, although some algorithms 
compound discriminative information preservation criteria, e.g., margin maximization; 
and 2) the obtained low dimensional representation linear combines of all variables in 
the high dimensional space, so it is difficult to clear interpret and efficiently represent 
data. 

Sparse learning provides sparse data representation via variable selection, and has 
the following advantages: 1) the sparsity improves the parsimony in computation, i.e., 
the computational cost can be significantly reduce; 2) the penalties and the constraints 
introduced in a learning model discourage the possible over-fitting of the model; and 3) 
the learned model can be well interpreted. However, existing sparse learning algorithms 



are designed for linear regression problems and the data intrinsic structure is usually 
ignored. 

To achieve the merits of manifold learning based dimensionality reduction and the 
advantages of sparse learning, in this paper, we propose the manifold elastic net (MEN), 
which is a general framework to obtain the sparse solution of the manifold learning 
based discriminative dimensionality reduction. There are few research results on com- 
bining sparse learning and discriminative dimensionality reduction because the projec- 
tion matrix of a lasso penalized model cannot be obtained directly by using the least 
angle regression (LARS). 

MEN is not a direct combination of the manifold learning based dimensionality 
reduction and the sparse learning. It however finds the optimal sparse solution of ev- 
ery manifold learning based discriminative dimensionality reduction algorithm via the 
patch alignment framework and a new classification error minimization based criterion. 
In particular, MEN encodes the local geometry of a set of samples and finds an aligned 
coordinate system for data representation under the patch alignment framework; MEN 
utilizes the classification error minimization criterion to directly link the classification 
error with the selected subspace; and MEN incorporates the elastic net regularization to 
sparsify the projection matrix. 

2.1 Part optimization 



Different manifold learning algorithms encode different types of local geometry of sam- 
ples, e.g., locally linear embedding (LLE) applies linear coefficients to reconstruct a 
sample by its neighbors. The patch alignment framework has well demonstrated that 
different algorithms have different optimization criteria to encode different local geom- 
etry over patches. 

In MEN, the same as the part optimization in the patch alignment framework, each 
patch is constructed by a particular sample Xi and its k related ones x^ , x^ , • • • , Xi k . 
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The patch is denoted by Xi 

linear mapping /j that projects the patch Xi 

M. d , i.e., fi : X t i — > Z u where Z { = [zf, z?,zf 2 , ■ ■ ■ , zf k ] T € K( fe+1 ) xd . The part 

optimization maximizes the similarity of the local geometry represented by Xi and that 

described by Zi\ 



argmintr yZ i LiZi 



(1) 



where Li <G R( fe+1 ) x ( fe+1 ) encodes the local geometry of the patch Xi and it is different 
over different dimensionality reduction algorithms. 

For a given sample xu its k related ones are divided into two groups: the k\ ones in 
the same class with Xi and the k^ ones from different classes with x%. These two groups 
are selected independently and denoted by {2^1 , xp , ■ ■ ■ , x^ } and {x^ , Xi 2 , ■ ■ ■ , Xi k } 
respectively. Therefore, the patch for Xi is defined by 



X t = 



Xi , Xji , Xj2 , • • • , X^k-i , x^ 1 , Xi 2 , • • • , £j fcl 



t,(fci+fc 2 + l)xp 



The corresponding the low dimensional representation is 



Z i , Zjl , Zj2 , • • • , Z^k-i , Z i 1 7 Z i 2 7 ' ' ' 7 Z i k± 



n(fcl + fc 2 +l)xd 



Let Fi = {ifi 1 , i 2 , ■ ■ ■ , i kl ,ii, i<i, • • • ,ik 2 } to be the index set. In the low dimensional 
subspace, we expect that the distances between the given sample and the group of re- 
lated samples from different classes are as large as possible, while the distances between 
the sample and the group of related samples in the same class are as small as possible. 
Therefore the part optimization is: 
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where k is a trade-off parameter to control the impacts of the two parts. Define the 
coefficient vector: 



fci 



k 2 



-i T 



i.^ rv, K/j •••; rZ 



(3) 



then we can obtain the part optimization matrix 



u = 



EK1+K2 I \ T 

-Wj diag (wj) 



(4) 



2.2 Whole alignment 

Each patch Xj for 1 < i < n has a corresponding low dimensional representation Zi 
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zi , zi, , zi 
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To unify all low dimensional patches Z, 

for 1 < i < n together into a consistent coordinate system, according to the patch 

alignment framework, we assume that the coordinate of Zi is selected from the global 



coordinate Z = \z[, zj, • • • ,z^ 

jre(fci+Ai2 + l)xn. 
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by a using sample selection matrix Si G 



Zi — Zbi, 
where the selection matrix Si is defined by 



(S t 



1, if q = Fi{p}; 



'■pi \ 0, else. 
According to Eq.5, the part optimization defined in Eq.l can be rewritten as: 

argmintr {Z Si LiSiZ) . 



(5) 



(6) 



(7) 



After summing over all part optimizations together, the whole alignment is given by: 

n 

argmin ^ tr (Z T SfL l S l Z) 



z 



= argmintr Z T ^(SfL l S i ) Z 



\ i=l / 

= a,rgmmti (Z T LZ) , (8) 

where L is the alignment matrix. It is obtained by an iterative procedure: 

L{F i ,F i )^L{F i ,F i )+L i . (9) 

It is worth emphasizing that the mapping / : X i-*- Z from the high dimensional 
space to the low dimensional subspace can be nonlinear and implicit. However, the 
linear approximation Z = XW is adopted, i.e., we expect the difference between Z 
and XW is minimized. In particular, W = [w\, W2, • • • , Wd] <E ]R pxd . Therefore, the 
objective function is: 

argmin tr (Z T LZ) + j3\\Z - XW\\\. (10) 

2.3 Classification error minimization 

In MEN, although the discriminative information for classification is considered duly 
in Eq. 10, the classification error is not directly modeled. To further enhance the perfor- 
mance of MEN for classification problems, it is necessary to provide an explicit way 
to represent the classification error minimization in the objective function. The least 
square error minimization is usually adopted in binary classification, 

argmin \\Y - XW\\ 2 2 . (11) 

However, it is very challenging to apply Eq.ll to multi-class classification. This is 
mainly because the class label vector C cannot be directly utilized as the output (re- 
sponse) Y . 

Recently, the least squares linear discriminant analysis (Ye 2007; Sun et al. 2008) 
or LS-LDA for short is proposed and presents the equivalence relationship between 
the least square formulation and the conventional linear discriminant analysis (LDA) 
for multi-class classification under a mild condition. However, the dimension of the 
indicator matrix is the number of classes c. Therefore, LS-LDA can only reduce the 
original data to a c — 1 dimensional subspace. It is pretty fine when samples are drawn 
from homoscedastic Gaussians because the Bayes optimal is achieved iff the dimension 
of the subspace is c — 1. However, for practical applications, samples are usually not 
sampled from homoscedastic Gaussians and a dozen of experimental evidences show 
that we usually achieve the best classification performance in a subspace lower than 
c — 1 when c is large. 



In this paper, we propose a flexible method to design the indicator matrix Y and the 
dimension of the selected subspace is allowed to be any number between 1 and c — 1. 
In comparing with LS-LDA, the proposed indicator design method is more flexible 
and powerful to gain a lower dimensional representation and higher recognition rate. 
Therefore, the new method meets most demands for practical applications, e.g., face 
recognition. 

The nearest-neighbor (NN) rule is commonly applied in classification problmes. In 
NN, it would be perfect when samples in the same class are projected onto the same 
point after dimensionality reduction, and this point is the low dimensional representa- 
tion of the corresponding class center. Meanwhile the variance of these projected class 
centers is expected to be maximized. As a consequence, the low dimensional projec- 
tion of class centers can be conveniently obtained by the weighted principal component 
analysis (PCA). 

In detail, suppose the given n samples belong to c classes, and there are c, samples 
in the i th class. The i th class center is Oj = (1/c,) Y^iT=i x h wnerem x j is the j th 
sample in the i th class and is a row vector in MP. The proportion of the i th class is 
Pi = Ci/n. Therefore, the weighted covariance matrix of class centers is given by: 

rn 

V = Y J ViO T i o i . (12) 

i=l 

Suppose we expect to find a d dimensional subspace. The d eigenvectors associated 
with the largest d eigenvalues i] — [771 , 772, • • • ,i]d\ of "K are selected to calculate the 
low dimensional representation of the class center Oi according to 

6i = Oil]. (13) 

Therefore, the indicator matrix Y = [j/i, t/2, ■ ■ ■ , y n ] is given by yj — 6^. On combin- 
ing Eq.10 and Eq.ll, we have 

argmin||y - XW\\j + atr (Z T LZ) + /3\\Z - XW\\\, (14) 

where a and f3 are trade-off parameters to control the impacts of different parts. 



2.4 Elastic net penalty 

In MEN, we expect to obtain a sparse projection matrix for explicit data representa- 
tion and effective interpretation, i.e., control the number of nonzero elements in each 
column of the projection matrix. This nonzero number of the entries of the projection 
matrix can be characterized by the ^o- norm of the projection matrix. We can impose 
it over the objective function defined in Eq.14 as a penalty. However, it turns to be an 
NP-hard problem and thus it is always impossible to be solved in a polynomial time, 
because the penalty is nonconvex (Lv and Fan 2009). Therefore, the ^i-norm of the 
projection matrix, i.e., lasso, is usually adopted as a relaxation of the £0 penalty. Al- 
though lasso is convex, it is difficult to find the solution of the lasso regularized model. 
This is because the lasso term is not differentiable. Least angle regression or LARS for 
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short has been proposed to greedily search the optimal solution of the lasso penalized 
linear regression problem. LARS continuously shrinks the particular coefficients (en- 
tries of the projection matrix W) towards zeros, while simultaneously preserves high 
prediction accuracy. 

However, the lasso penalty has the following two disadvantages: 1) the number of 
selected variables is limited by the number of observations and 2) the lasso penalized 
model can only selects one variable from a group of correlated ones and does not care 
which one is selected. By imposing an i^-norm of the projection matrix on the lasso 
penalized problem, similar to the elastic net, we can overcome the aforementioned two 
disadvantages and retain the favorable properties of the lasso penalty. In detail, the £2- 
norm of the projection matrix is helpful to increase the dimension (and the rank) of the 
combination of the data matrix and the response. In addition, the combination of the £\ 
and £2 of the projection matrix is convex with respect to the projection matrix and thus 
the obtained projection matrix has the grouping effect property. 

Therefore, to obtain a sparse projection matrix W with the grouping effect, both 
£i-norm and ^2-norm of the projection matrix are added as penalties to the objective 
function defined in Eq. 14 and we obtain the full definition of MEN: 



argmin \\Y - XW\\\ + atr (Z T LZ) + (3\\Z - XW\\\ + Ai||W||i + \ 2 \\W\ 



2.5 LARS for MEN 



(15) 



It has been demonstrated that LARS is effective and efficient to find the optimal so- 
lution of the lasso or the elastic net (the combination of £\ and £2) penalized multiple 
linear regression. Therefore, it can be directly applied to penalized least squares only. 
However, the proposed MEN defined in Eq.15, at the first glance, is not a penalized 
least square. 

In this Section, we detail utilizing LARS to obtain the optimal solution of MEN. 
Although LARS is designed to solve the penalized multiple linear regression where 
the coefficients are a vector rather than a matrix, the column vectors of the projection 
matrix W in MEN are independent bases. Therefore, we can calculate them one by 
one. In the following analysis, we consider a particular column of W, i.e., Wi, and the 
corresponding vector y, in the indicator matrix Y, To simplify the notations below, we 
keep using W and Y instead of Wi and j/j. 

Because the low dimensional representation Z and the projection matrix W are 
independent, we can eliminate Z in the objective function. In detail, Z is obtained by 
setting the differentiate of the objective function F with respect to Z as 0, i.e., 

— = a (L + L T ) Z + 2/3 (Z - XW) = 0. (16) 

oZ ' 



Therefore, we have 



Z = /3{aL + /3I) 1 XW. (17) 
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According to Eq.17, we can eliminate Z in the objective function defined in Eq.15, and 
thus we have: 

argmin W T X T AXW - 2W T X T Y + \AW\U + A 2 ||W|||. (18) 

z.w 

where this A is an asymmetric matrix computed from L: 

A =a ((3 (aL + fiiy 1 ) L ((3 (aL + fiiy 1 ) + 

(1 (j3 (al + pi)' 1 - j) (p{aL + piy l -I^+I. (19) 

To apply LARS to obtain the optimal solution of Eq.18, we expect the first item in 
it to be a quadratic form. Because 2X T AX — X T (A + A T ) X and the eigenvalue 
decomposition of (A + A T ) /2 can be written as UDU T , the objective function defined 
in Eq.18 without the elastic net penalty can be rewritten as: 



W T X T AXW - 2W T X T Y 
-W T X T (d 1/2 u t ) (d 1/2 u t ) XW - 2W T X T (d 1/2 u t 



D l ' 2 U T 



Y 



D l ' 2 U T 



Y - (d 1/2 U t ) XW 



(20) 



The constant item can be ignored in optimization without loss of generality. We further 

set 



x* = (i + \ 2 r 1/2 



(D^U?) X 
v / A 2 7 pxp 



(n+p)xp and 



(21) 
(22) 



V* = 



(D l l 2 U T Y ) Y 
r xi 



in Eq.18, and then we get 



argmin II Y* - X*W* 

w* 



• (n+p)xl 



X\\W* 



(23) 



(24) 



where A = Ai/ (1 + A 2 ) and W* = VI + \ 2 W. 

According to Eq.24, the LARS algorithm can be applied to obtain the optimal so- 
lution of the proposed MEN. LARS provides an efficient algorithm to solve the lasso 
penalized multiple linear regression. 

Below we sketch LARS for the transformed MEN defined in Eq.24 and provide 
novel viewpoints to LARS, which are helpful to better understand the proposed MEN. 

We begin with a coefficient vector W* (a column in the projection matrix with i th 
entry (W*) i with all zero entries. A variable (a column vector in X, i.e., a particular 



12 

feature) in R™ is most correlated with the objective function is added to the active set 
A. Then the corresponding coefficient in W* increases as large as possible until a sec- 
ond variable (another column vector in X, i.e., another feature) in R™ has the same 
correlation as the first variable. Instead of continuously increasing the coefficient vector 
in the direction of the first variable, LARS proceeds on a direction equiangular over 
all variables in the active set A until a new variable earns its way into A. To make 
the coefficient vector W* becomes i^-sparse (at most K nonzero entries), we conduct 
the above procedure for K loops. The optimization path direction and the correspond- 
ing path length (step size) in LARS are determined by the correlations, which are the 
negative gradient of the objective function defined in Eq.24 without the lasso penalty, 
i.e., 

C = -g^ = 2 ( X *f ( Y * - X * W *) = [ci, C2, ■ ■ ■ , c P f • (25) 

The constant 2 can be simply ignored without loss of generality in the following analy- 
sis. 

The larger the correlation q is, the more important the corresponding variable will 
be, and thus the larger the corresponding coefficient (W*) i in W* will be. In sparse 
learning, important variables are added to the active set A sequentially according to 
their corresponding correlations defined in Eq.25, and then the direction and distance 
of coefficient vector of all the important variables are determined. 

Let A be the active set of "most correlated" variables whose coefficients are nonzero, 
while the other variables form an inactive set /. Thus the sparsity is determined by the 
cardinality of A. The correlations of variables in A are always identical to each other in 
A and larger than the correlations of variables in /. Those correlations of variables in 
/ are usually different to each other. Initially, all the variables are in inactive set I and 
thus the corresponding coefficients are all zero. 

To make W* i^-sparse, we need to conduct the following three steps for K loops. 
In the first step, the variable in the inactive set / with the largest correlation is added to 
the active set A, i.e., 

C = max{|cj|} and A = U : \c \ = c\ , (26) 

where dj is the current correlation of the j th variable. 

In the second step, the direction of the coefficient vector W* is calculated. To make 
the optimization more global and less greedy, the correlations of the active variables 
are required to decrease equally in preferred direction. In the k th loop, if the direction 
vector is u>, then the current correlation is given by 

C k ={X* A f {Y* -X*W£) 

= {X* A ) T (Y* - X* (WU+ P")) 
=C k -i+p(X* A ) T X* A WA, (27) 

where X A contains all variables in A and each its column is sampled from X*, C k -i 
is the correlation in the (k — l) th loop, p is a constant that is irrelevant to the direction 
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computation, uia stores directions associated with variables in A, and the change of the 
correlation at this step is (X A ) X a u>a- The sign of ua> i- e -> s , is identical to that of 
Cfe_i, so we can calculate the magnitude of uja directly and then assign its sign as s. 
This X a u>a is an extended simplex with vertices defined by active variables. We project 
the i th column of X*, i.e., (X*) t , onto X a u>a and thus we get (X*)J X a loa- Because 
the correlations of the active variables are required to decrease equally in preferred 
direction, i.e., (X*) i X a uja equals to each other over the index i, the only possible so- 
lution of X a lua is the normal vector through the origin in the simplex space. Therefore, 
we have 

uj A = s-(X A T X A y 1 \ A = s-G A 1 \A, (28) 

where Ga = X* A ' X* A is the Gram matrix of X A . In LARS (Efron et al. 2004), lu a is 
obtained by minimizing the squared distance between the point X a loa on the simplex 
and the origin, subject to ||uu||i = 1. 

To normalize the change of the correlation X a t X a uia to a unit vector ua, we need 
to update Aa and u>a, and thus we obtain a normalized ua, i-e-, 



-1/2 



A A =«- [1 A G-/1 A ) , (29) 

wa*-s- AaG^Ia and (30) 

ua<-X* a u a . (31) 

In the third step, we calculate the distance or magnitude of changes p\. To have 
an efficient optimization procedure, this p\ should be as large as possible. At the same 
time, we have to guarantee that correlations of variables in A are always identical to 
each other in A and larger than correlations of variables in /. Therefore p is increased 
until the correlation of a particular variable in / is equivalent to the correlations of active 
variables, i.e., 

. ( C-c, C + c,- ] 

pi = min+ AC I 3 - , 3 ) , (32) 

Jfc/1 y A A - aj A A + a,j J 

where Ac is the complement of A, a = X a t ua, cij is the j th entry of a, C is the largest 
correlation defined in Eq.26 and obtained in the first step, and p\ is a possible candidate 
of p mentioned in Eq.27. 

According to LARS, to obtain an identical solution to MEN defined in Eq.24, the 
lasso modification is considered, i.e., the argument of the distance p stops increasing 
when a coefficient of variables in A is zero, or mathematically, 

WX k = WX k _ 1 +P2SA0JA = 0, (33) 

where pi is another possible candidate of p defined in Eq.27. According to Eq.33, we 
can obtain 

p 2 = min+ {-WXk-JsAUA} ■ (34) 
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Therefore, the distance of W*, i.e., p, is the minimum of p\ and p 2 , i.e., 

p = mm + {p 1 ,p 2 }. 



(35) 



In each loop, one new variable is added to the active set A according to Eq.26, the 
direction and distance of the coefficient vector W* are calculated according to Eq.31 
and Eq.35. After K loops, W* is i^-sparse. According to the elastic net, to eliminate 
the double shrinkage, the optimal W should be corrected: 



w = y/l + \ 2 W* 



(36) 



2.6 Fast LARS 

LARS is inefficient when the size of the training set is large, because the time cost for 
calculating the inverse of the Gram matrix Ga defined in Eq.32 is huge. Because the 
dimension of this Ga is increasing at each of the K loops, according to (Golub and Van 
Loan 1996), the inverse of Ga can be obtained incrementally, i.e., the inverse of the 
Gram matrix (GU fe ) in the k loop can be updated from (G J 4 fc _ 1 ) in the previous 
loop. Particularly, in the k th loop, a new variable (X), £ R" is added to the active set 
A, and thus we have 



G A k 



^^A k ^A k 



Ak-l 
.(*)f. 



X-Av-X-Ak 



2\ 2 I 

[X Ak _ 1 (X) l ]+2X 2 I 



2\ 2 I 



X A k _ 1 X A k ^ 1 X\ ki (X) t 

(X^Xa^ (X)f(X). 

xl k _ i x Ak _ 1 + 2\ 2 i x T Aki {x) i 

(X^Xa^ (X)J(X) t + 2\ 2 



(37) 



Let A, B, C and D be the blocks of G A , i.e., A ■■ 

ixfAX), 



XI^Xa^ + 2\ 2 I, B = 
2\ 2 . Let Sa to be the Schur 



complement of A, i.e., Sa = D — CA~ 1 B. According to rules of the block matrix 
calculation, (GU fe 



{Gj 



is given by: 



A^BS^-CA- 1 -A^BSa 1 



J A O -Tl 



c-1 
°A 



(38) 



where A^ 1 = (G^.J is the inverse of the Gram matrix obtained in the previous 
loop. The time cost for calculating the inverse of the Gram matrix in the k th loop can 
be reduced from 0(p 3 ) to 0(p 2 + 5p) (p is the size of active set in the k th loop) when 
the inverse of the Gram matrix in the previous loop is available. 

We can further accelerate the computation of LARS for MEN by taking the ad- 
vantage of the sparse structure of X*. For example, when calculating the equiangular 
vector a and the inner product Ga, the block matrix calculation can reduce the time 
cost as well. 
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2.7 Algorithm 

In this paper, we propose an efficient framework MEN for discriminative dimensionality 
reduction with sparse projection. Based on the discussion in the above six subsections, 
MEN is shown in Algorithm 1 . 

Algorithm 1 Manifold Elastic Net (MEN) 

Input: Training data matrix X = [xi, Xi, ■ ■ ■ , c„] £ M. nxp ; 
Class label vector C — [ci, Ci, ■ ■ ■ , c n ] ; 

W = [wi , W2, ■ ■ ■ , Wd] G px , where d is the dimensions of subspace; 
The number of loops K, small K induces sparser W. 
Output: Sparse projection matrix W — [wi, W2, ■ ■ • , Wd] G R pxd . 
Initialize: k := 0. 
repeat 

Step 1: Optional PC A reconstruction of original data X. 

Step 2: Part optimization: build n patches for the n given samples according to definition of 

manifold, calculate matrix Li for each patch using Eq.3 and Eq.4. 
Step 3:Whole alignment: unify the patches in a global coordinate, compute big matrix L 

using Eq.9. 
Step 4: Classification error minimization: Calculate the indicator matrix Y using scaled 

PC A for class centers using Eq.13. 
Step 5: New data matrix and indicator matrix: Calculate X* and Y* from X and Y using 

Eq.22 and Eq.23. 
Step 6: Column by column loops for W,k :— k + 1. 
Initialize: m := 0. 
repeat 

m := rn + 1. 

Update active set: add the variable with largest correlation to A using Eq.25 and Eq.26. 
Direction calculation using Eq.30, Eq.31 and fast LARS Eq.38. 
Distance calculation using Eq.32, Eq.34 and Eq.35. 
Update Wh using Eq.36. 
until m=K. 

Step 7: Update projection matrix W by adding w^ into W. 
until k=d. 
return W . 



In MEN, after necessary initializations, we first build patches for all training sam- 
ples by calculating Li of each patch in the part optimization according to Eq.4 in sub- 
section 1. Then these Li matrixes are unified in a global coordinate system into one 
matrix L according to Eq.9 in whole alignment step explained in subsection 2. After- 
wards, the indicator matrix Y is computed according to the weighted PCA over class 
centers defined in Eq.13 in subsection 3. A matrix A defined in Eq.15 in the objective 
function can be obtained from L and other parameters. The eigenvalue decomposition is 
conducted over (^4 + A T ) /2 to construct the new data matrix X* and the new indicator 
matrix Y* according to Eq.22 and Eq.23, respectively. 

Then the LARS algorithm is applied to calculate a sparse projection matrix. The 
direction and distance of each loop are computed according to Eq.31 and Eq.35. The 
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incremental method to obtain the inverse of the Gram matrix defined in Eq.38 is con- 
sidered speeding up LARS. This process is conducted several times and the projection 
matrix is computed column by column. Finally a sparse projection matrix is obtained 
as the output of MEN. This matrix is ready to project a given sample in MP to a low 
dimensional subspace M d with if-sparse. 

MEN is an efficient algorithm with high convergence velocity, because the computa- 
tion in LARS explained in subsections 5 and 6 is equivalent to the cost of a least square 
fit. Given a training set X £ R raxp , to obtain a sparse matrix W £ MP xd each column of 
which contains K nonzero elements, d times of LARS are required in MEN. Most steps 
in LARS are simple matrix computations. For p>n, MEN requires 0(dK 3 + dpK 2 ) 
operations. 

2.8 Discussions 

MEN integrates the merits of both manifold learning and sparse learning via a unified 
framework. It is not a direct combination of these two popular learning schemes but a 
complementary embedding of both. Through the patch alignment framework, the local 
geometry of a given dataset is retained in MEN. The weighted lasso and £2 penalties 
are added to produce a sparse projection matrix with the grouping effect. The com- 
bined lasso and £2 is also termed as the elastic net. Therefore, we term the proposed 
framework as the manifold elastic net. As a consequence, MEN is superior to existing 
dimensionality reduction algorithms, because of its powerful variable selection function 
and consideration of the intrinsic structure of the dataset. 

It has been well demonstrated that LARS is effective and efficient to solve a lasso 
regularized least square problem. Therefore, to apply LARS to find the optimal solution 
of MEN, it is essential to prove that MEN is equivalent to a lasso regularized least square 
problem and LARS converges for optimization. In particular, we prove that LARS can 
optimize a general form of the lasso regularized problem, which contains both MEN 
and the lasso regularized least square problem as special cases. 

Theorem 1. LARS can solve a general form of the lasso regularized problem defined 
below: 

aigmm (3 T Af3 + p T B + C + t\\0\\i, (39) 

where j3 £ R pxl and A £ MP xp (could be an asymmetric square matrix), B £ M pxl , 
and C and t are constants. 

Proof. It is equivalent to prove that the problem defined in Eq.39 is equivalent to a lasso 
regularized least square problem. 

The objective function defined in Eq.39 without the lass penalty can be written as: 

/3 T Ap + p T B + C = [3 T [ + 2 A ) 13 + P T B + C, (40) 

where (^4 + A T ) /2 £ M pxp is a symmetric matrix and its eigenvalue decomposition 
is (A + A T ) /2 - UDU T . 
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Therefore, we have: 



pT (A+^_\ /3 + p T B+r 



=0 r {D 1 / 2 U T ) ( D ^ 2 U T )p- 

2/3 T ( D l/2jj 



'' T Uf(j) 1/2 /'' 



D 



C 



D l ' 2 U T 



B - {D l ' 2 U T ) (3 



const. 



To simply represent the above objective function, without loss of generality, let 

Y = -U(d 1 ' 2 U t ) T \ B.X^^D 1 ' 2 ^), 



(41) 



(42) 



and ignore the constant. Therefore, we can transform the problem defined in Eq.39 to 



argmin||F-X/3||2 + t||/3||i, 



(43) 



which is a lasso regularized least square problem. It is not difficult to prove that MEN 
is a special case of the problem defined in Eq.39. Therefore, LARS can be applied to 
solve MEN and the problem defined in Eq.39. 

Theorem 2. LARS converges in optimizing the problem defined in Eq.39 in Theorem 1. 

Proof. Let the objective function defined in Eq.39 without the lasso penalty be F. After 
the k th loop, assume the estimate of the objective function becomes F^. If F is smooth 
in each loop, we have: 

(dF k 



F k -F, 



fc-i 



dF k 



dfr 



max 



dF k 
dF k 



Pi=p; 



&i=K 



(44) 



where fit is the i element in coefficient vector /?, and u> is the change of (3 between 
two consecutive loops, i.e., u) = f3 k — j3 k ~ l = [wi,W2, - "" i w p] ■ 

In LARS for the problem defined in Eq.39, the sign of u) is the negative gradient of 
objective function F on j3 , i.e., 



sign (u>i) = sign - 



m 

dpi 



&i=K 



(45) 



In each loop of LARS, when correlation of one active variable becomes zeros, the 
length of the coefficient path will stop increasing. Therefore, the sign vector of correla- 
tions will not change in one loop, i.e., 



sign 



m 

' d/3 t 



Pi=0t 



= sign 



dFk 

'dpi 



ft=/3, 



, F k — Fk-i , 
sign [ ) = -sign (wj 
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Fig. 1. Sample face images from the three databases. The first row comes from UMIST; the 
second row comes from FERET; and the third row comes from YALE. 



According to the analyses, we can obtain the sign of (Ff. — Fk-i): 
signfTfc - F fe _i) = -sign(cj) • sign(w) = -1. 



(46) 



According to the above equation, the objective function F is monotonic. In addition, 
F is bounded. Therefore, we can safely draw the conclusion that LARS converges in 
optimizing the problem defined in Eq.39. 



3 Experiments 



In this section, we evaluate the performance of MEN by comparing against six represen- 
tative dimensionality reduction algorithms, i.e., principal component analysis (PCA), 
Fisher's linear discriminant analysis (FLDA), discriminative locality alignment (DLA) 
(Zhang et al. 2008; Zhang et al. 2009), supervised locality preserving projection (SLPP), 
neighborhood preserving embedding (NPE), and sparse principal somponent analysis 
(SPCA), on three standard face image databases, i.e., UMIST (Graham and Allinson 
1939), FERET (Phillips et al. 2000) and YALE (Belhumeur et al. 1997). 

PCA is an unsupervised linear dimensionality reduction algorithm which projects 
the data along the direction of maximal variance. FLDA is a supervised linear dimen- 
sionality reduction method. SLPP is a supervised modification of the locality preserv- 
ing projections, which is a linearization of the Laplacian Eigenmap. NPE is a linear 
approximation to the locally linear embedding (LLE). SPCA is a sparse dimensional- 
ity reduction algorithm which combines the lasso penalty with PCA to produce sparse 
loadings. 

Three standard face image datasets, e.g., UMIST, FERET and YALE, are utilized 
in this paper to evaluate the proposed MEN for discriminative dimensionality reduc- 
tion. There are 565 face images from 20 individuals in the UMIST dataset. The samples 
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Fig. 2. Recognition Rate vs. Dimension on FERET 



demonstrate variations in race, gender, pose and appearance. The FERET dataset con- 
sists of 13, 539 face images from 1, 565 individuals. The images vary in size, gender, 
pose, illumination, facial expression and age. We randomly select 100 individuals, each 
of which has 7 images from FERET for performance evaluation. The YALE dataset 
contains 165 face images of 15 individuals. Lighting conditions, gender, facial expres- 
sions and configurations are different among these images. All images from these three 
databases are normalized to 40 x 40 pixel arrays with 256 gray levels per pixel. Fig.l 
shows sample images from these three datasets. Each image is reshaped to a long vector 
by concatenating its pixel values in a particular order. 

Different algorithms follow an equivalent procedure for all face recognition exper- 
iments on various datasets. Firstly, the database is randomly divided into two separate 
sets: training set and testing set. Then the training set is used to learn the low dimen- 
sional subspace and corresponding projection matrix through given algorithm. After 
this, samples in the testing set are projected to a low dimensional subspace via the 
projection matrix. Finally, the nearest neighbor classifier is used to recognize testing 
samples in the subspace. 

We apply PCA to reduce dimensions of original high dimensional face images be- 
fore FLDA, DLA, LPP (with supervised setting) and NPE (with supervised setting). For 
FLDA, we retain n — c dimensions in the PCA projection, where n is the number of 
samples and c is the number of classes. We project samples to the PCA subspace with 
n - 1 dimensions for DLA, SLPP and NPE. 

For UMIST and YALE, we randomly select p = (5, 7) images per individual for 
training, while the remaining images are used as testing samples. For FERET, p = (4, 5) 
images per individual are selected as training set, and the remaining for testing. All 
experiments are repeated five times, and the average recognition rates are calculated. 

The results of these dimensionality reduction algorithms on two settings of FERET 
are shown in Fig. 2. These seven algorithms can be divided into 3 groups according to 
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Fig. 3. Recognition Rate vs. Dimension on UMIST 



their performance: PCA and SPCA are at the bottom level, because they are unsuper- 
vised and the label information is not considered. PCA is slightly better than SPCA, 
because SPCA is designed to approximate PCA but with less information retained to 
hold the sparse property. LPP, NPE and LDA are at the middle level. They are much 
better than PCA and SPCA because they consider the class label information. LPP and 
NPE preserve the local geometry based on the neighborhood information of samples, 
while LDA ignores the local geometry. LPP and NPE cannot perform as well as DLA 
and MEN because both of them ignore the margin maximization or the inter-class infor- 
mation. MEN and DLA are at the top level. MEN outperforms DLA because it reduces 
the noises by using the elastic net penalty. 

Experimental results on UMIST are shown in Fig. 3. MEN outperforms the other six 
algorithms consistently. Note the fact that MEN keeps having the highest recognition 
rate when the dimension of the selected subspace is low. This verifies the robustness 
of MEN in low dimension situation. In addition, the computational cost is proportional 
to the dimension of the selected subspace. Therefore MEN produces better results with 
less computational cost than other dimensionality reduction methods. 

Fig. 4 shows MEN outperforms the other six algorithms on the YALE dataset. The 
curves of MEN are smoother than those of the other algorithms. This implicates that 
MEN is more stable than the other algorithms. MEN has high recognition rate even 
when the training set is small and the dimensions of the selected subspace is low. The 
priority of MEN can be attributes to its supervised learning property, consideration of 
data manifold structure, feature selection ability brought by sparsity and the grouping 
effect. The sparsity of MEN filters out classification irrelevant features, which bring 
unnecessary noises for classification. This is effective especially when the number of 
classes is much smaller than the number of the original features. Furthermore, the sparse 
projection matrix brings better interpretation and lower computational cost for subse- 
quent calculation than dense projection matrices. 
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Fig. 4. Recognition Rate vs. Dimension on YALE 



Table 1 lists the best recognition rate and the corresponding subspace dimension for 
each algorithm in the experiments on the three face image datasets. Sparse dimensional- 
ity reduction algorithm including MEN and SPCA always arrive their best recognition 
rate in lower dimensional subspace than other five algorithms. This is because the spar- 
sity brought by the lasso penalty is able to select the most significant features. However, 
because SPCA does not consider the class label information, it always performs more 
poorly than other supervised algorithms. For each algorithm, the dimension of the best 
recognition rate is decreasing with the increasing of training samples. This is because 
more training samples make the low dimensional representation more stable and reli- 
able. 





MEN 


DLA 


LPP 


NPE 


LDA 


PCA 


SPCA 


FERET 


4 


90.67(17) 


88.67(19) 


74.00(17) 


74.33(21) 


76.33(25) 


48.00(54) 


45.67(41) 


5 


96.50(30) 


88.50(35) 


83.50(36) 


82.00(19) 


84.00(49) 


54.00(51) 


48.50(58) 




MEN 


DLA 


LPP 


NPE 


LDA 


PCA 


SPCA 


UMIST 


5 


95.89(17) 


94.57(18) 


90.11(19) 


89.68(19) 


88.21(18) 


88.63(13) 


80.63(19) 


7 


99.21(16) 


97.62(19) 


95.40(19) 


95.17(18) 


97.24(14) 


93.79(19) 


90.57(18) 




MEN 


DLA 


LPP 


NPE 


LDA 


PCA 


SPCA 


YALE 


5 


82.78(13) 


79.11(12) 


79.33(13) 


77.11(14) 


82.22(12) 


61.11(12) 


63.33(13) 


7 


90.33(12) 


87.00(12) 


85.00(13) 


84.33(11) 


81.67(11) 


66.67(13) 


63.33(12) 



Table 1. Best recognition rate (%) on three databases. For MEN, DLA, LPP (SLPP), NPE, LDA 
(FLDA), PCA, SPCA (Sparse PCA), the numbers in the parentheses behind the recognition rates 
are the subspace dimensions. Numbers in the second column denote the number of training sam- 
ples per individual. 
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Boxplots of the experimental results of these seven dimensionality reduction algo- 
rithms on the three face image datasets are shown in Fig. 5, Fig. 6 and Fig. 7, respectively. 
Each boxplot produces a box and whisker plot for each method. The box has lines at 
the lower quartile, median, and upper quartile values. Whiskers extend from each end 
of the box to the adjacent values in the data-by default and the most extreme values 
within 1.5 times the interquartile range from the ends of the box. 

MEN achieves the most robust recognition rate, because it considers the sparse 
property, the local geometry of intra-class samples, and the margin maximization and 
classification error minimization of inter-class samples. MEN selects features with the 
largest correlation and eliminates the most unstable ones. Manifold learning methods, 
such as LPP, DLA and NPE, as well as LDA are more stable than PCA and SPCA 
according to these boxplots because they consider the class label information. 

Fig. 8, Fig. 9 and Fig. 10 show the columns of the projection matrix of the seven al- 
gorithms on the three face image datasets. The low dimensional subspace is spanned by 
the column vectors, which is called bases. The bases of PCA are called Eigenfaces (Turk 
and Pentland 1991), while the bases of LDA are called Fisherfaces (He et al. 2005) in 
previous literatures. Similar methods can be applied to DLA, SLPP, NPE, SPCA and 
MEN. The bases of MEN are sparser and have less noise than PCA and DLA because 
of its sparsity, and more grouping than SPCA because of its grouping effect adopted 
from the £2 penalty. Sparse bases lead to computational efficiency and good interpreta- 
tion. According to Fig. 8, Fig. 9 and Fig. 10, "MEN faces" retain the most discriminative 
facial features, e.g., eyebrows, eyes, nose, mouth, ears and facial contours, while leave 
the other parts blank. "SPCA faces" are sparse but without the grouping effect, their fa- 
cial contours and organs are represented by some isolate points. "LPP faces" and "NPE 
faces" are very similar in appearances and this fact well explains that they perform 
comparably in these datasets. "DLA faces" have better description of features and less 
noises than those obtained by LPP, NPE and FLDA. 

In each LARS loop of the MEN algorithm, according to the algorithm listed in 
Algorithm 1, all entries of one column in the MEN projection matrix are zeros initially. 
They are sequentially added into the active set according to their importance. The values 
of active ones are increased with equal altering correlation. In this process, the t\- 
norm of the column vector is augmented. Fig. 11 shows the altering tracks of some 
entries of the column vector in one LARS loop. We called these tracks "coefficient path" 
in LARS. In Fig.ll, every coefficient path starts from zero when the corresponding 
variable becomes active, and changes its direction when another variable is added into 
the active set. All the paths keep in the directions which make the correlations of their 
corresponding variables equally altering. The ^i-norm is increasing along the greedy 
augment of entries. The coefficient paths proceed along the gradient decent direction of 
objective function on the subspace, which is spanned by the active variables. 

Fig. 12 shows 10 of the 1600 coefficient paths from LAPS loop for the first base 
in experiment on FERET dataset. MEN selects ten important variables (facial features) 
sequentially here. Each feature, its corresponding coefficient path and the"MEN fac" 
when the feature is added into active set are assigned the same color which is different 
with the other 9 features. In each "MEN face", the new added active feature is marked 
by a small circle, and all the active features are marked by white crosses. The features 
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Fig. 5. Boxplot of Recognition Rate vs. Dimension (from 21 to 30) on FERET with 4 (5) training 
samples per person. For every dimension, from left to right, the seven boxes refer to MEN, DLA, 
LPP, NPE, FLDA, PCA, and SPCA. 
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Fig. 6. . Boxplot of Recognition Rate vs. Dimension (from 10 to 19) on UMIST with 5 (7) training 
samples per person. For every dimension, from left to right, the seven boxes refer to MEN, DLA, 
LPP, NPE, FLDA, PCA, and SPCA. 
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Fig. 7. Boxplot of Recognition Rate vs. Dimension (from 5 to 14) on YALE with 5 (7) training 
samples per person. For every dimension, from left to right, the seven boxes refer to MEN, DLA, 
LPP, NPE, FLDA, PCA, and SPCA. 
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Fig. 8. Plots of first 10 bases obtained from 7 dimensionality reduction algorithms on FERET For 
each column, from top to bottom: MEN, DLA, LPP, NPE, FLDA, PCA, and SPCA 
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Fig. 9. Plots of first 10 bases obtained from 7 dimensionality reduction algorithms on UMIST For 
each column, from top to bottom: MEN, DLA, LPP, NPE, FLDA, PCA, and SPCA 
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Fig. 10. Plots of first 10 bases obtained from 7 dimensionality reduction algorithms on YALE For 
each column, from top to bottom: MEN, DLA, LPP, NPE, FLDA, PCA, and SPCA 
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Fig. 11. Entries of one column of projection matrix vs. its £i-norm in one LARS loop of MEN 



selected by MEN can produce explicit interpretation of the relationship between facial 
features and face recognition: feature 1 is the left ear, feature 2 is the top of nose, feature 

3 is on the head contour, feature 4 is the mouth, feature 5 and feature 6 are on the left 
eye, feature 7 is the right ear, and feature 8 is the left corner of mouth. These features 
are already verified of great importance in face recognition by many other famous face 
recognition methods. Moreover, Fig. 12 also shows MEN can group correlated features, 
for example, feature 5 and feature 6 are selected sequentially because they are both 
on the left eye. In addition, features which are not very important, such as feature 9 
and feature 10 in Fig. 12, are selected after the selection of the other more significant 
features and assigned smaller value than those more important ones. Therefore, MEN 
is a powerful algorithm in variable (feature) selection. 

4 Conclusion 



In this paper, we propose a unifying framework which obtains a sparse projection ma- 
trix for subsequent classification, termed manifold elastic net or MEN for short. MEN 
incorporates the advantages of both manifold learning based dimensionality reduction 
and sparse learning based dimensionality reduction, but it is not a direct combination 
of these two. To obtain a sparse projection matrix, MEN imposes the elastic net penalty 
over a loss function that is defined under the patch alignment framework. The objective 
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Fig. 12. Coefficient paths of 10 entries (features) in one column vector 



function of MEN can be transformed into a lasso penalized least square problem by 
using a series of complex linear algebra equivalent transformations, and thus the least 
angle regression (LARS) can be applied to obtain the optimal sparse projection matrix. 

In MEN, the patch alignment framework is first used to construct local patches of 
data and unifies these patches into a global coordinate system. Secondly, the classifica- 
tion error is minimized directly via weighted principal component analysis (PCA) over 
class centers. Thirdly, to obtain a sparse projection matrix with the grouping effect, the 
elastic net penalty is added to the objective function. After a series of equivalent trans- 
formations, MEN can be rewritten as a lasso-type regression. Therefore, LARS can be 
applied to solve the problem efficiently. In each LARS loop for MEN optimization, 
important variables are added into the active set sequentially according to their correla- 
tion. All the elements in the active set are altered along a special direction with a special 
distance in each step. The special direction and distance keep the correlation of active 
elements identical and the largest in a LARS loop. The procedure is conducted several 
times to obtain a set of sparse bases because these bases are independent. 

MEN enjoys advantages in several aspects: 1) the local geometry of intra-class sam- 
ples is well preserved for low dimensional data representation, 2) both the margin max- 
imization and the classification error minimization are considered for discriminative 
information preservation, 3) the sparsity of the projection matrix of MEN improves the 
parsimony in computation, 4) the elastic net penalty reduces the over-fitting problem, 
and 5) the projection matrix of MEN can be interpreted psychologically and physiolog- 
ically. 

Experimental results of face recognition on UMIST, FERET and YALE show that 
MEN performs better and more stable than popular dimensionality reduction algo- 
rithms, such as the principal component analysis (PCA), Fisher's linear discriminant 
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analysis (FLDA), the discriminative locality alignment (DLA), the locality preserv- 
ing projections with supervised setting (LPP), the neighborhood preserving embedding 
with supervised setting (NPE), and the sparse principal component analysis (SPCA). 

There are still many interesting properties of MEN which have not been targeted 
and formally proved in this paper. In the future, we will analyze its error bounds under 
different situations. Another important problem in MEN is how to choose the optimal 
sparsity, so that we can remove most noise and retain most discriminative information 
for subsequent classification. The compressed sensing may be an effective tool to ad- 
dress the above concern. It is also valuable to replace the lasso penalty with the 4)-norm 
penalty to further improve MEN with more "accurate sparsity". The lasso penalty is a 
relaxation of ^o- norm penalty, and there are alternatives which could perform better 
than the lasso penalty, e.g., the smoothly clipped absolute deviation penalty (SCAD) 
(Fan and Li 2001), the reweighted i\ minimization (Candes et al. 2008), the adaptive 
lasso (Zou 2006b) and the adaptive elastic net (Zou and Zhang 2009). The advantages of 
these methods can be adopted in MEN to further enhance the variable selection ability 
of MEN, and there is still a long way to go. 
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